Skip to content

fix: Fix deathlock during statically registering services#4657

Open
pavel-jares-bcm wants to merge 10 commits into
v3.x.xfrom
reboot/eureka-registry-deathlock-fix
Open

fix: Fix deathlock during statically registering services#4657
pavel-jares-bcm wants to merge 10 commits into
v3.x.xfrom
reboot/eureka-registry-deathlock-fix

Conversation

@pavel-jares-bcm
Copy link
Copy Markdown
Contributor

Description

When service is statically registered (since #4051) there is one additional lock in the class ApimlInstanceRegistry (see the synchronization removal). In case of multiple threads it could lead to service freezing on start-up during reading the static definition file.

Lock order of registerStatically:

  • org.zowe.apiml.discovery.ApimlInstanceRegistry#registerStatically : synchronized (lock)
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : read.lock();
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : synchronized (lock)
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : leave synchronized block, but it is still locked
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : read.unlock();
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : unlock synchronized block

Lock order of register:

  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : read.lock();
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : synchronized (lock)
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : unlock synchronized block
  • com.netflix.eureka.registry.AbstractInstanceRegistry#register : read.unlock();

Deathlock:
thead1 (register): read.lock()
thead2 (registerStatically): synchronized (lock)
thread1 (register): is waiting for synchronized block to be available (to be done by thread2)
thread2 (registerStatically): is waiting for read.unlock() (to be issued by thread1)

Note: locks are used also by other methods but it should be the issue.

The source is the REST call from any south-bound service during the initialization of static services. The risk is higher when the static definition file or amount of service is bigger.

The additional lock (see the synchronized (lock) in registerStatically) was added because the original method register shouldn't change the property expectedNumberOfClientsSendingRenews. Using the same lock should prevent exclusivity to update the field. It is not correct solution because of orders of locks.

The new solution store information about the change (correction - the oposite operation) in a ThreadLocal, so the locks are not necessary now.

The similar issue was in cancel method as well. But it is not big deal as this method is used very rarelly.

Other changes in the PR becomes from analysis that using reflection is not needed at this version of Eureka (there were very probably changes in the overriden code). We can call the whole method now and it is not necessary to reimplement it via calling private methods via reflection.

Linked to # (issue)
Part of the # (epic)

Type of change

Please delete options that are not relevant.

  • fix: Bug fix (non-breaking change which fixes an issue)
  • feat: New feature (non-breaking change which adds functionality)
  • docs: Change in a documentation
  • refactor: Refactor the code
  • chore: Chore, repository cleanup, updates the dependencies.
  • BREAKING CHANGE or !: Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines of this project
  • PR title conforms to commit message guideline ## Commit Message Structure Guideline
  • I have commented my code, particularly in hard-to-understand areas. In JS I did provide JSDoc
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • The java tests in the area I was working on leverage @nested annotations
  • Any dependent changes have been merged and published in downstream modules

For more details about how should the code look like read the Contributing guideline

Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
@pull-request-size pull-request-size Bot added size/XL and removed size/L labels Jun 1, 2026
@pavel-jares-bcm pavel-jares-bcm marked this pull request as ready for review June 1, 2026 11:48
@pull-request-size pull-request-size Bot added size/L and removed size/XL labels Jun 1, 2026
@taban03
Copy link
Copy Markdown
Contributor

taban03 commented Jun 1, 2026

Not directly related to your PR, but are we sure that this calculation of adding a fixed +2 in getNumOfRenewsInLastMinis consistent in case of multiple static services registered? Also, could this affect the Eureka metrics (i.e. actuator) that may show more renewal than actually occurred?

@pavel-jares-bcm
Copy link
Copy Markdown
Contributor Author

Not directly related to your PR, but are we sure that this calculation of adding a fixed +2 in getNumOfRenewsInLastMinis consistent in case of multiple static services registered? Also, could this affect the Eureka metrics (i.e. actuator) that may show more renewal than actually occurred?

I am not sure, but I would say it make sense only for Modulith and the number is incorrect, there should be 3 for apiml (CS, Catalog, GW) and maybe also amount of statically registered services, but I don't consider it as part of this PR. Maybe we can create a issue for verification.

void givenStaticDefinition_whenFailRegistration_thenCorrectionAndExpectedNumberOfClientsSendingRenewsDidntChange() {
var renewCorrection = (ThreadLocal<Integer>) ReflectionTestUtils.getField(apimlInstanceRegistry, "RENEW_CORRECTION");
renewCorrection.set(123);
ReflectionTestUtils.setField(apimlInstanceRegistry, "expectedNumberOfClientsSendingRenews", 5);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines are repeated, so they can be set globally

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to @beforeeach

Signed-off-by: Pavel Jareš <Pavel.Jares@broadcom.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 1, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

2 participants